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Abstract. This paper first analyzes the resolution complexity of two random CSP models 
(i.e. Model RB/RD) for which we can establish the existence of phase transitions and identify 
the threshold points exactly. By encoding CSPs into CNF formulas, it is proved that almost 
all instances of Model RB/RD have no tree-like resolution proofs of less than exponential size. 
Thus, we not only introduce new families of CNF formulas hard for resolution, which is a central 
task of Proof-Complexity theory, but also propose models with both many hard instances and 
exact phase transitions. Then, the implications of such models are addressed. It is shown both 
theoretically and experimentally that an application of Model RB/RD might be in the generation 
of hard satisfiable instances, which is not only of practical importance but also related to some 
open problems in cryptography such as generating one-way functions. Subsequently, a further 
theoretical support for the generation method is shown by establishing exponential lower bounds 
on the complexity of solving random satisfiable and forced satisfiable instances of RB /RD near 
the threshold. Finally, conclusions are presented, as well as a detailed comparison of Model 
RB/RD with the Hamiltonian cycle problem and random 3-SAT, which, respectively, exhibit 
three different kinds of phase transition behavior in NP-complete problems 
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1. Introduction 



Over the past ten years, the study of phase transition phenomena has been one of the most exciting 
areas in computer science and artificial intelligence. Numerous empirical studies suggest that for 
many NP-complete problems, as a parameter is varied, there is a sharp transition from 1 to 
at a threshold point with respect to the probability of a random instance being soluble. More 
interestingly, the hardest instances to solve are concentrated in the sharp transition region. As well 
known, finding ways to generate hard instances for a problem is important both for understanding 
the complexity of the problem and for providing challenging benchmarks for experimental evaluation 
of algorithms [12]. So the finding of phase transition phenomena in computer science not only 
gives a new method to generate hard instances but also provides useful insights into the study of 
computational complexity from a new perspective. 

Although tremendous progress has been made in the study of phase transitions, there is still 
some lack of research about the connections between the threshold phenomena and the generation 
of hard instances, especially from a theoretical point of view. For example, some problems can 
be used to generate hard instances but the existence of phase transitions in such problems has 
not been proved. One such an example is the well-studied random 3-SAT. A theoretical result by 
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Chvatal and Szemeredi [10] shows that for random 3-SAT, no short proofs exists in general, which 
means that almost all proofs for this problem require exponential resolution lengths. Experimental 
results further indicate that instances from the phase transition region of random 3-SAT tend to 
be particularly hard to solve [25]. Since the early 1990's, considerable efforts have been put into 
random 3-SAT, but until now, the existence of the phase transition phenomenon in this problem 
has not been established, although recently, Friedgut [14] made tremendous progress in proving 
that the width of the phase transition region narrows as the number of variables increases. On 
the other hand, for some problems with proved phase transitions, it was found either theoretically 
or experimentally that instances generated by these problems are easy to solve or easy in general. 
Such examples include random 2-SAT, Hamiltonian cycle problem and random 2+p-SAT (0 < p < 
0.4). For random 2-SAT, Chvatal and Reed [11] and Goerdt [20] proved that the phase transition 
phenomenon will occur when the ratio of clauses to variables is 1. But we know that 2-SAT is in 
P class which can be solved in polynomial time, implying that random 2-SAT can not be used to 
generate hard instances. For the Hamiltonian cycle problem which is NP-compete, Komlos and 
Szemeredi [22] not only proved the existence of the phase transition in this problem but also gave 
the exact location of the transition point. However, both theoretical results [9] and experimental 
results [32] suggest that generally, the instances produced by this problem are not hard to solve. 
Different from the above two problems, random 2+p-SAT [30] was first proposed as an attempt to 
interpolate between the polynomial time problem random 2-SAT with p = and the NP-complete 
problem random 3-SAT with p = 1. It is not hard to see that random 2+p-SAT is in fact NP- 
compelte for p > 0. The phase transition behavior in this problem with < p < 0.4 was established 
by Achlioptas et al. and the exact location of the threshold point was also obtained [1]. But it 
was further shown that random 2+p-SAT is essentially similar to random 2-SAT when < p < 0.4 
with the typical computational cost scaling linearly with the number of variables [29] . 

As mentioned before, from a computational theory point of view, what attracts people most 
in the study of phase transitions is the finding of many hard instances in the phase transition 
region. Hence, starting from this point, we can say that the problem models which can not be used 
to generate random hard instances are not so interesting for study as random 3-SAT. However, 
until now, for the models with many hard instances, e.g. random 3-SAT, the existence of phase 
transitions has not been established, not even the exact location of the threshold points. So, 
from a theoretical perspective, we still do not have sufficient evidence to support the long-standing 
observation that there exists a close relation between the generation of many hard instances and the 
threshold phenomena, although this observation opened the door for, and has greatly advanced the 
study of phase transitions in the last decade. From the discussion above, an interesting question 
naturally arises: whether there exist models with both proved phase transitions and many hard 
instances and, if so, what are the implications of such models. 

Recently, to overcome the trivial asymptotic insolubility of the previous random CSP models, 
Xu and Li [33] proposed a new CSP model, i.e. Model RB, which is a revision to the standard 
Model B. It was proved that the phase transitions from solubility to insolubility do exist for Model 
RB as the number of variables approaches infinity. Moreover, the threshold points at which the 
phase transitions occur are also known exactly. Based on previous experiments and by relating 
the hardness of Model RB to Model B, it has already been shown that Model RB abounds with 
hard instances in the phase transition region. In this paper, we will first propose a random CSP 
model, called Model RD, along the same line as for Model RB. Then, by encoding CSPs into CNF 
formulas, we will prove that almost all instances of Model RB/RD have no tree-like resolution 
proofs of less than exponential size. This means that Model RB/RD are hard for all popular CSP 
algorithms because such algorithms are essentially based on tree-like resolutions [24]. Therefore, 
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we not only introduce new families of CNF formulas hard for resolution, which is a central task 
of Proof-Complexity theory, but also propose models with both many hard instances and exact 
phase transitions. More importantly, it will be shown that an application of RB/RD might be 
in the generation of hard satisfiable instances, which is not only of significance for experimental 
studies, but also of interest to the theoretical computer science community. Finally, exponential 
lower bounds will be established for random satisfiable and forced satisfiable instances of RB/RD 
near the threshold. 

2. Model RB and Model RD 

A Constraint Satisfaction Problem, or CSP for short, consists of a set of variables, a set of possible 
values for each variable (its domain) and a set of constraints defining the allowed tuples of values 
for the variables (a well-studied special case of it is SAT). The CSP is a fundamental problem in 
Artificial Intelligence, with a distinguished history and many applications, such as in knowledge 
representation, scheduling and pattern recognition. To compare the efficiency of different CSP 
algorithms, some standard random CSP models have been widely used experimentally to generate 
benchmark instances in the past decade. For the most widely used CSP model (i.e. standard Model 
B), Achlioptas et al. [2] proved that except for a small range of values of the constraint tightness, 
almost all instances generated are unsatisfiable as the number of variables approaches infinity. This 
result, as shown in [19], implies that most previous experimental results about random CSPs are 
asymptotically uninteresting. However, it should be noted that Achlioptas et al.'s result holds 
under the condition of fixed domain size and so is applicable only when the number of variables 
is overwhelmingly larger than the domain size. But in fact, it can be observed that the domain 
size, compared to the number of variables, is not very small in most experimental CSP studies. 
This, in turn, explains why there is a big gap between Achlioptas et al.'s theoretical result and 
the experimental findings about the phase transition behavior in random CSPs. Motivated by the 
observation above, and to overcome the trivial asymptotic insolubility of the previous random CSP 
models, Xu and Li [33] proposed an alternative CSP model as follows. 

Model RB: First, we select with repetition m = rnlnn random constraints. Each random 
constraint is formed by selecting without repetition k of n variables, where k > 2 is an integer. 
Next, for each constraint we uniformly select without repetition q = p ■ d k incompatible tuples of 
values, i.e., each constraint contains exactly (1 — p) ■ d k allowed tuples of values, where d = n a is 
the domain size of each variable and a > is a constant. 

Note that the way of generating random instances for Model RB is almost the same as that for 
Model B. However, like the N-queens problem and Latin square, the domain size of Model RB is 
not fixed but polynomial in the number of variables. It is proved that Model RB not only avoids 
the trivial asymptotic behavior but also has exact phase transitions. More precisely, the following 
theorems hold for Model RB, where Pr(Sat) denotes the probability that a random CSP instance 
generated by Model RB is satisfiable. 

Theorem 1 (Xu and Li [33]) Let r cr = — i n (°_ p ) ■ lia>^, 0<p<l are two constants and 
k, p satisfy the inequality k > then 

lim Pv(Sat) = 1 when r < r cr , 
lim Pr(Sat) = when r > r cr . 

n^oo 

Theorem 2 (Xu and Li [33]) Let p cr = 1 — . If a > ^, r > are two constants and k, a 
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and r satisfy the inequality ke r > 1, then 

lim Pi(Sat) = 1 when p < p cr , 

n^oo 

lim Pv(Sat) = when p > p cr . 

As shown in [33], many instances generated following Model B in previous experiments can 
also be viewed as instances of Model RB, and more importantly, the experimental results for these 
instances agree well with the theoretical predictions for Model RB. Therefore, in this sense, we 
can say that Model B can still be used experimentally to produce benchmark instances. However, 
to guarantee an asymptotic phase transition behavior and to generate random hard instances, a 
natural and convenient way is to vary the values of CSP parameters under the framework of Model 
RB. Note that another standard CSP Model, i.e. Model D, is almost the same as Model B except 
that for every constraint, each tuple of values is selected to be incompatible with probability p. 
Similarly, we can make a revision to Model D and then get a new Model as follows. 

Model RD: First, we select with repetition m = rnlnn random constraints. Each random 
constraint is formed by selecting without repetition k of n variables, where k > 2 is an integer. 
Next, for each constraint, from d k possible tuples of values, each tuple is selected to be incompatible 
with probability p, where d = n a is the domain size of each variable and a > is a constant. 

Along the same line as in the proof for Model RB [33] , we can easily prove that exact phase 
transitions also exist for Mode RD. More precisely, Theorem 1 and Theorem 2 hold for Model RD 
too. In fact, it is exactly because the differences between Model RB and Model RD are very small 
that many properties hold for both of them and the proof techniques are also almost the same. So 
in this paper, we will discuss both models, denoted by Model RB/RD. 

Recently, there has been a growing theoretical interest in random CSPs, especially with respect 
to their phase transition behaviors [13, 16, 17, 27, 31, 35] and resolution complexity [18, 26, 28]. 
To discuss the resolution complexity of CSPs, we first need to encode a CSP instance into a CNF 
formula. In this paper we will adopt the encoding method used in [24]. For convenience, we give 
the outline of this method here. For each CSP variable u, we introduce d propositional variables, 
called domain variables, to represent assignments of values to u. There are three sets of clauses 
needed in the encoding, i.e. the domain clauses asserting that each variable must be assigned a 
value from its domain, the conflict clauses excluding assignments violating constraints and clauses 
asserting that each variable is assigned at most one value from its domain. 

3. Resolution Lower Bounds for Model RB/RD 

In this section, we will analyze the resolution complexity of unsatisfiability proofs for Model RB/RD 
and get the following result. 

Theorem 3 Let P be a random CSP instance generated following Model RB/RD. Then, 
almost surely, P has no tree-like resolutions of length less than 2 n(n ) . 

When we say that a property holds almost surely it means that this property holds with prob- 
ability tending to 1 as the number of variables approaches infinity. 

The core of the proof for Theorem 3 is to show that almost surely there exists a clause with large 
width in every refutation. The width of a clause C, denoted by w(C), is the number of variables 
appearing in it. The width of a set of clauses is the maximal width of a clause in the set. The 
width of deriving a clause C from the formula F, denoted by w(F h C) is defined as the minimum 
of the widths of all derivations of C from F. So, the width of refutations for F can be denoted by 
w(F h 0). Ben-Sasson and Wigderson [8] gave the following theorem on size-width relations and 
proposed a general strategy for proving width lower bounds for CNF formulas. 
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Theorem 4 (Ben-Sasson and Wigderson [8]) Let F be a CNF formula and St(F) be the 
minimal size of a tree-like refutation. Then we have 



St(F) > 2M FK) )- U '( F )). 

By extending Ben-Sasson and Wigderson's strategy, Mitchell [26] proved exponential resolution 
lower bounds for some random CSPs of fixed domain size. In what follows, to obtain lower bounds 
on width for RB/RD, we will basically use the same strategy as in [26], but adapt it to handle 
random CSPs with growing domains. First, we prove the following local sparse property for RB/RD. 

Lemma 1 Let P be a random CSP instance generated by Model RB/RD. There is constant 
c > such that almost surely every sub-problem of P with size s < cn has at most b = (3s In n 
constraints, where (3 



6fcln 



l-p 



Proof: As mentioned in [27], this is a standard type of argument in random graph theory. 
Similarly, we consider the number of sub-problems on s variables with b = (3s Inn constraints for 
< s < cn. There are (") possible choices for the variables and (™) for the constraints. Given such 

choices, the probability that all the b constraints are in the s variables is not greater than (^)' 
So, the number of such sub-problems is at most 
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For sufficiently large n, there exists a constant c\ > such that 
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s \ (k-l)/3lnn-l 



Let c < i exp ^— (frfp ^ be a positive constant. For < s < cn, it follows from the above inequality 
that 



n 2 J n 2 



Thus the expected number of such sub-problems with s < cn is at most 



E(r)(:)er<^-u>. 

s=l v ' x ' 



This finishes the proof. 

The following two definitions will be of use later. 
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Definition 1 Consider a variable u and i constraints associated with u. In these i constraints, 
all the variables except u have already been assigned values from their domains. We call this an 
i- constraint assignment tuple, denoted by Tj jU . 

Definition 2 Given a variable u and an i-constraint assignment tuple Tj u . We assign a value 
v to u from its domain. So, all the variables in the i constraints of Tj jU have been assigned values. 
If at least one constraint in Tj )U is violated by these values, then we say that the value v of u is 
flawed by Ti jU . If all the values of u in its domain are flawed by Tj jU , then we say that the variable 
u is flawed by Tj )U , and Tj jU is called a flawed i-constraint assignment tuple. 

Lemma 2 Let P be a random CSP instance generated by Model RB/RD. Almost surely, there 
does not exist a flawed i-constraint assignment tuple Tj jU in P with i < 3k3 In n. 

Proof: Now consider an i-constraint assignment tuple Tj )U with i < 3kB\nn. It is easy to see 
that the probability that Tj )U is flawed increases the number of constraints i. Recall that in Model 
RD, for every constraint, each tuple of values is selected to be incompatible with probability p. So, 
given a value v of u, the probability that v is flawed by Tj jU is 

Thus the probability that all the d = n a values of u are flawed by u , i.e. the probability of Ti )U 
being flawed is 

[i-(i- P y] d . 

Note that 3 = — ■ - , . Thus for < i < 3k 3 Inn, we have 

l-p 

Pr(T i!U is flawed) 1^/3 inn < [l - (1 - p) 3k(3lnn 

n 2 

The above analysis only applies to Model RD. For Model RB, such an analysis is much more 
complicated, and so we leave it in the appendix. Recall that there are n variables and m = m In n 
constraints. So the number of possible choices for i-constraint assignment tuples is at most 
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For i < 3k3lnn, when n is sufficiently large, there exists a constant C2 > such that 
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Thus the expected number of flawed i-constraint assignment tuples with i < 3kd In n is at most 

3«/31nn , . 3fc/31nn 

n( m )d( fc -^Pr(T iiU is flawed) < e C2ln2 " Pr(T iiU is flawed) 

i=l ^ ? ' i=l 



= e C2ln2 "-0(e^ n ^) -3A:/31nn 
= o(l). 
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This implies that almost surely, there does not exist a variable u and an i-constraint assignment 
tuple Tj jU with i < 3k(3lnn such that u is flawed by Tj )U . This is exactly what we need and so we 
are done. □ 

Lemma 3 Let P be a random CSP instance generated by Model RB/RD. Almost surely, every 
sub-problem of P with size at most cn is satisfiable. 

Proof: Here by the size of a problem we mean the number of variables in this problem. We 
will prove this lemma by contradiction. Assume that we have an unsatisfiable sub-problem of size 
at most cn. Thus we can get a minimum sized unsatisfiable sub-problem with size s < cn, denoted 
by Pi. From Lemma 1 we know that almost surely Pi has at most (5s Inn constraints. Thus there 
exists a variable u in Pi with degree at most k(3 In n, i.e. the number of constraints in Pi associated 
with u is not greater than k{3\nn. Removing u and the constraints associated with u from Pi, we 
get a sub-problem P2. By minimality of Pi, we know that P2 is satisfiable, and so there exists an 
assignment satisfying P2. Suppose that the variables in P2 have been assigned values by such an 
assignment. Now consider the variable u and the i constraints associated with u, where i < k(5\an. 
By Definition 2 this constitutes an i-constraint assignment tuple for u, denoted by Tj )U . Recall that 
Pi is unsatisfiable. This means that no value of u can satisfy all the i constraints. That is to say, 
the variable u is flawed by Tj jU . Therefore, if a sub-problem of size at most cn is unsatisfiable, then, 
almost surely, there is a variable u and an i-constraint assignment tuple Tj jU such that u is flawed 
by Pj,«, where i < k(3\nn. This is in contradiction with Lemma 2 and so finishes the proof. □ 

Now we will prove that there almost surely exist a complex clause in the refutation proofs of 
Model RB/RD. The complexity of a clause was defined in [26] by Mitchell, i.e. for any refutation 
7r, the complexity of a clause C in tt, denoted by u(C), is the size of the smallest sub-problem II 
such that C can be derived by resolution from </>(n). Along the same line as in the proof of [26], 
we have the following lemma. 

Lemma 4 Let P be a random CSP instance generated by Model RB/RD. Almost surely, every 
refutation tt of (j>(P) has a clause C of complexity ^ < u{C) < cn. 

Proof: For this proof, please refer to [26]. □ 

Lemma 5. Let C be a clause of complexity ^ < u(C) < cn. Then, almost surely, C has at 
least |n literals, i.e. w{C) > 

Proof: We will prove this by contradiction. For a CSP instance P, its CNF encoding is denoted 
by 4>(P). Let C be a clause of complexity ^ < u(C) < cn and Pi be the smallest problem such that 
4>(Pi) \= C. Hence, the size of Pi is at least |n and at most cn. By Lemma 1, there are at most 
f3cnlnn constraints in Pi. So, there are at most |n variables with degree greater than 3fc/31nra. 
Then, there are at least |n — |n = |n variables in Pi with degree at most 3fe/31nn. We will prove 
that for these variables, almost surely, there does not exist a variable such that no domain variable 
of it appears in C. Now assume that we have a variable u in Pi with degree i < 3k(3 In n and no 
domain variable of it appears in C. Removing u and the constraints associated with it from Pi, we 
get a sub-problem P2. By minimality of Pi, we know that 4>(P2) \/= C. So we can find an assignment 
satisfying P2 but not satisfying C. Suppose that the propositional variables in P2 and C have been 
assigned values by such an assignment. Now consider the variable u and the constraints associated 
with it. By Definition 2, this constitutes an i-constraint assignment tuple for u, denoted by Tj jU . By 
assumption, no domain variable of u appears in C. So, assigning any value to u will not affect the 
truth value of C. Recall that <j>{Pi) \= C and C is false under the current assignment. Therefore, no 
value of u can satisfy 4>(Pi), i.e. setting any value to u will violate at least one constraint associated 
with it. It follows that u is flawed by Tj jU , i.e. there exists a flawed i-constraint assignment tuple 
with i < 3k(3lnn. This is in contradiction with Lemma 2 and so we are done. □ 
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Combining Lemma 4 and Lemma 5, we have that, for a random CSP instance P generated by 
Model RB/RD, almost surely, w{4>(P) h 0) > |n. Now, by use of Theorem 4, we finish the proof. 
One point worth mentioning is that when a > 1, the initial width of clauses is greater than or 
equal to the number of variables. In such a case, to make Theorem 4 applicable, we only need to 
introduce some new variables and reduce the widths of domain clauses, which has no effect on our 
results. 

4. Generating Hard Satisfiable Instances 

As mentioned before, the finding of phase transitions in NP-complete problems provides a good 
method for generating random hard instances which are very useful in the evaluation of algo- 
rithms. In recent years, a remarkable progress in Artificial Intelligence has been the development 
of incomplete algorithms for various kinds of problems. To evaluate the efficiency of such incom- 
plete algorithms, we need a source to generate only hard satisfiable instances [3]. However, since the 
probability of being satisfiable is about 0.5 at the threshold point where the hardest instances are 
concentrated, the generator based on phase transitions will usually produce a mixture of satisfiable 
and unsatisfiable instances. So, it is interesting to study how the phase transition phenomenon can 
be used to generate hard satisfiable instances. Besides practical importance, more interestingly, 
the problem of generating random hard satisfiable instances is related to some open problems in 
cryptography, e.g. computing a one-way function, generating pseudo-random numbers and private 
key cryptography [12, 21, 23]. 

In fact, for constraint satisfaction and Boolean satisfiability problems, there is a natural strategy 
to generate instances that are guaranteed to have at least one satisfying assignment. The strategy 
is as follows [3]: first generate a random truth assignment t, and then generate a certain number of 
random constraints or clauses one by one to form a random instance, where any clause or constraint 
violating t will be rejected. The above strategy is very simple and can be easily implemented. But 
unfortunately, this strategy was proved to be unsuitable for random 3-SAT because it in fact 
produces a biased sampling of instances with many satisfying assignments (clustered around t), 
and experiments also show that these instances are much easier to solve than random satisfiable 
instances [3]. In the following, for convenience, we will call the satisfiable instances generated using 
the strategy as forced satisfiable instances. 

Now let us look further into the problem why the strategy fails for random 3-SAT. As defined in 
[33, 34], an assignment pair < t±,t2 > is an ordered pair of two assignments t\ and £2- We say that 
< ti,tz > satisfies a CSP if and only if both t\ and ti satisfy this CSP. Suppose that the number 
of variables is n and the domain size is d. Then we have totally d n possible assignments, denoted 
by ti, <2, ■ " " ,td^, and d 2n possible assignment pairs. Let U be a forced satisfying assignment. Then 
the expected number of solutions for forced satisfiable instances satisfying ti, denoted by Ef[N], is 



where Pr[< U,tj >] denotes the probability that < U,tj > satisfies a random instance. Note that 
Ef[N] should be independent of the choice of the forced satisfying assignment ti. So we have 



J]Pr[< U,tj >} 



E f [N] 



3=1 



Pr[< ti,U >] 



^ Pr[< 



E[N 2 } 
E[N] ' 



E f [N] 



l<i,j<d n 



d n Pr[< U,U >] 
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where E[N 2 ] and E[N] are, respectively, the second moment and the first moment of the number 
of solutions for instances generated randomly. For random 3-SAT, it follows from the result on 
satisfying assignment pairs in [34] that asymptotically, E[N 2 ] is exponentially greater than E 2 [N]. 
This conclusion can also be found in [4]. Thus, the expected number of solutions for forced satisfiable 
instances is exponentially larger than that for random satisfiable instances, which gives a good 
theoretical explanation of why, for random 3-SAT, the strategy is highly biased towards generating 
instances with many solutions. 

We now consider the problem of generating satisfiable instances for Model RB/RD using the 
same strategy. Recall that when we established the exact phase transitions for RB/RD [33], it 
was proved that E[N 2 ]/ E 2 [N] is asymptotically equal to 1 below the threshold, where almost all 
instances are satisfiable, i.e. E[N 2 ]/E 2 [N] pt 1 for r < r cr or p < p cr . So, we have that for 
RB/RD, the expected number of solutions for forced satisfiable instances below the threshold is 
asymptotically equal to that for random satisfiable instances, i.e. Ej[N] = E[N 2 ]/E[N] ~ E[N]. 
In other words, the strategy has almost no effect on the number of solutions for RB/RD and thus 
will not lead to a biased sampling of instances with many solutions. 

In addition to the analysis above, we can also study the influence of the strategy on the distri- 
bution of solutions with respect to the forced satisfying assignment. Based on the definition of sim- 
ilarity numberva. [33], we first define a distance on the assignments as d 1 * (ii,<2) = l — S*((ti,t2))/n, 
where t±,t2 are two assignments, n is the total number of variables and ((ii,<2)) is equal to the 
number of variables at which the two assignments take the identical values. It is easy to see that 
< d,f (ti,t2) < 1. Let Ef[X] and E[X] respectively denote, for forced satisfiable instances and 
random satisfiable instances, the expected number of solutions with a fixed distance dt from the 
forced satisfying assignment. By an analysis similar to that in [33] (pp. 96-97), we have 



Indeed, it can be shown, from the results in [33] (pp. 97-98), that Ef[X], for r < r cr or p < p cr , 
will be asymptotically maximized when dt takes the largest possible value, i.e. dt = 1. For random 
satisfiable instances of RB/RD, we have 



It is straightforward to see that the same pattern holds for this case, i.e. E[X] will be asymptotically 
maximized when dt = 1. So, intuitively speaking, for RB/RD, given an assignment t, for both 
forced satisfiable instances satisfying t and random satisfiable instances, most solutions distribute 
in a place far from t. This further indicates that the strategy has little effect on the distribution 
of solutions for RB/RD, and so it will not be be biased towards generating instances with many 




E[X] 




rn In n 



= exp [nlnn (r ln(l — p) + adt) + 0(n)\ . 
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solutions around the forced satisfying assignment. For random 3-SAT, similarly, we have 



E f [X] 



= fi(n) exp n I —dt In dt — (1 — dt) ln(l — dt) + r In 




6 + (l-d t ) 3 
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) 



and 




rn 



= / 2 (n) exp n -d t In cZ t - (1 - d t ) ln(l -d t )+r\n- 



) 



where /i(n) and /2(n) are two polynomial functions. It follows from the results in [34] that as r (the 
ratio of clauses to variables) approaches 4.25, Ef[X] and E[X] will be asymptotically maximized 
when dt ~ 0.24 and dt = 0.5 respectively. This means, in contrast to RB/RD, that compared 
with random satisfiable instances, most solutions of forced satisfiable instances distribute in a place 
much closer to the forced satisfying assignment when r is near the threshold. 

Note that the number and the distribution of solutions are the two most important factors 
determining the cost of solving satisfiable instances. So, we can expect, from the above analysis, 
that for RB/RD, the hardness of solving forced satisfiable instances should be similar to that of 
solving random satisfiable instances. More interestingly, it therefore seems that we can, based on the 
hardness of RB/RD, propose a new method to generate hard satisfiable instances, i.e. generating 
forced satisfiable instances of RB /RD with a large number of variables near the threshold identified 
exactly by Theorem 1 or Theorem 2. Experimental results have further confirmed this idea 2 . It 
is shown, in one experiment for RB with k = 2, n = 30, d = 15 and m = 250, that the mean 
time of solving forced satisfiable instances near the threshold is only slightly smaller (11 percent) 
than that of solving random satisfiable instances with the same parameters 3 . More importantly, 
experiments for RB also indicate that the hardness of solving forced satisfiable instances grows 
exponentially with the number of variables 4 near the threshold, and we can, in fact, generate 
forced satisfiable instances appearing to be very hard to solve (for both complete and incomplete 
algorithms) even when the number of variables is only moderately large (e.g. k = 2, n = 59, a = 0.8 
and r = 0.8/ In | with constraint tightness p = p cr = 0.25 computed by Theorem 2, or equivalently 
expressed as k = 2,n = 59, d = 26 and m = 669 with the same tightness 5 ) 6 . Although there 
have been some other ways to generate hard satisfiable instances empirically, e.g. the quasigroup 
method [3], we think that the simple and natural method presented in this paper, based on models 
(i.e. Model RB/RD) with exact phase transitions and many hard instances, should be well worth 
further investigation. 

2 We thank Dr. Christophe Lecoutre and Liu Yang very much for performing the experiments. 

3 As specified by the conditions of Theorem 2, to make exact phase transitions hold, the values of a and 
r should not be small. So, we should choose dense CSPs with a large domain. 

4 According to the definitions of RB/RD and Theorems 1 and 2, the parameters a, r and p should be fixed 
when n increases. The values of the threshold points can also be obtained from these two theorems. 

5 If non-integer values occur in the computation of d and m from n, a and r, then we round them to the 
nearest integers. 

6 Benchmarks of Model RB (in both SAT and CSP format) are available at www.nlsde.buaa.edu.cn/~kexu/ 
benchmarks/benchmarks. htm. 
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5. Exponential Lower Bounds for Satisfiable Instances of Model RB/RD 

For random CSP instances of RB/RD, we know from Theorems 1 and 2 that almost surely, they 
are satisfiable below the threshold and unsatisfiable above the threshold. For satisfiable instances, 
there are no resolution proofs, or, if any, the resolution proofs are of infinite length. Therefore, the 
exponential resolution lower bounds, established in Theorem 4, are of interest only for instances 
above the threshold. Also, in many other cases, exponential lower bounds have been shown only 
for unsatisfiable instances, and it seems quite difficult to derive such lower bounds for satisfiable 
instances. A recent progress in this direction, made by Achlioptas et. al. [5], is that exponential 
lower bounds have been established for certain natural DPLL algorithms on some provably satisfi- 
able instances of random fc-SAT for k > 4. In this section, we will analyze the complexity of solving 
RB/RD below the threshold and obtain the following results. 

Theorem 5 Given a random CSP instance of RB/RD with r cr — e r < r < r cr or p cr — e p < p < 



Pc 



where e r = — 



' 12 fc , 



and e p = [l - exp (-7(1 - 



21 



are two positive constants, we uniformly select without repetition j^n variables, and assign each of 
these variables a value from its domain at random. If such values does not violate any constraint, 
then, almost surely, the residual formula is unsatisfiable and has no tree-like resolution proofs of 
less than exponential size. 

Proof: Let E[X] denote the expected number of assignments satisfying the residual formula. 
By assumption, the partial assignment to the -^n variables does not violate any constraint. Then 

-, rnlnn 



E[X] = d n ~T2i 



1 



P 



1 



c 

12 1 



For r cr — e r < r < r cr , we have 



E[X] < n an ^-^ 



(r C r—£r)n\nn 



< exp 



— e r In ( 1 — p ( 1 — 



I2 k 



ac\ 
nmn 

12/ 



ac 



= exp ( — — nlnn ) = o(l). 



By Markov's inequality, we know that the residual formula will be almost surely unsatisfiable. For 
the phase transition with respect to p, the proof can be done similarly. Now we prove that for 
the residual formula, any sub-problem of size at most cn is almost surely satisfiable. Based on the 
proofs of Lemmas 2 and 3, we only need to show that for any sub-problem with size 1 < s < cn 
containing unassigned variables, there almost surely exists an unassigned variable with degree at 
most 3fc/31nn. Thus, it is sufficient to prove that for any sub-problem with size 1 + ^n < s < cn 
+j2~n containing the j^n assigned variables, there almost surely exists an unassigned variable with 
degree at most 3k (3 In n. For such a sub-problem, the probability that an unassigned variable has a 
degree at least 3kf3lnn is not greater than 

b 



rn Inn 
b 



n 



kb-b 



where b = 3k(3 In n. 



Then, the probabilty that all the unassigned variables have degrees at least 3k(3 In n is not greater 
than 



rn Inn 
b 



e \ kb—b 

) 
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There are (™_ possible choices for such sub-problems. So the expected number of such sub- 
problems with size l + -j%n<s<cn+-j%nisat most 



E 

■l+T 

cn 

s E 
* E 



rralnn\ / A;6\ ( 1 
6 JU 



n J \nJ 



kb- 



where b = 3kf3 In n 



e(n — ,%n) 



*- T2 n 



6 



re \ 3fc/31nn /a\3fc(fc-l)/91nn 



n / 



In the proof of Lemma 1, we define e (jf^j < nCl an d c < \ ex P ( — (frfp) • Substituting them 
into the above inequality, we get 



E 

S=l +T2 n 
cn +T2 

£ E 



re 



^3fe/31nn 3fc(fc _ 1)/31nn 

Vn/ 



3fcci 



1 



en- 



g3fc 23fc/31nn 



n 



— 3fcci— 6fe 



n+ -k n / 1 x 

E » ? 

-1 I c „ \ / 



c c 

where 1 H n < s < cn -\ n 

12 ~ ~ 12 



as required. Now for the residual formula, Lemmas 3 and 4 follow immediately. Recall that in 
Lemma 5, we prove that there are at least |n variables in Pi with degree at most 3k(3\nn. For the 
residual formula where -^n variables have been assigned values, there are at least -^n variables 
in Pi with degree at most 3kf3lnn. Similarly, we can prove that almost surely, there is a clause 
with at least -j^n literals for the residual formula. By Theorem 4, we finish the proof. Note that 
the constant c can be chosen to monotonically decrease with r or p. Here we can, therefore, take 
the value of c as that for r = r cr or p = p cr and try to make it as small as possible (in order to 
guarantee that e r and e p are two positive constants). □ 
Generally speaking, different search algorithms use different strategies to search for solutions. 
Rather than focusing on some specific algorithms, we relate the hardness of solving satisfiable 
instances to that of solving unsatisfiable sub-problems, because if it takes a long time to solve the 
sub-problems generated in the search process, then the original problem can not be solved quickly 
[24]. Theorem 5 indicates that for satisfiable instances of RB/RD below and close to the threshold, 
if a resolution-based algorithm can not detect any contradiction in the early stage of a search 
branch, then the algorithm will, very likely, generate a large-sized unsatisfiable sub-problem. As 
a result, it will, then, almost surely take exponential time to explore large subtrees to prove the 
unsatisfiability of the sub-problem. Indeed, there are exponentially many large-sized unsatisfiable 
sub-problems. More precisely, it can be computed that the total number of residual formulas with 



12 



-^n assigned variables and without violating any constraint is at least 

acn In n 



12 



■n 



d^ n 1 



r cr n In n 

<- 2 fp) 



n 



12 



n 



exp 



12 



1 



P 



121n(l -p) J 



= exp (J7(nlnn)) . 

So, intuitively speaking, when solving satisfiable instances of RB/RD near the threshold, backtrack- 
style algorithms will very easily fall into pitfalls with no solutions, and then, worse still, take a 
long time to escape from these pitfalls. To our best knowledge, this is the first result on the 
complexity of solving satisfiable instances near the proved threshold, which can help us to gain a 
better understanding of the extreme hardness of instances in the phase transition region. 

For random forced satisfiable instances near the proved threshold, similarly, we have the fol- 
lowing result. 

Theorem 6 Given a random forced satisfiable instance of RB /RD with r cr — e r < r < r cr or 



Per ~ e P < P < Per, where e r = - ln( f_ p) + 



and e*= [1- exp (-2(1- a))]^- 



1 + exp (— -) are two positive constants, we uniformly select without repetition j^n variables, and 
assign each of these variables a value from its domain at random. If such values does not violate any 
constraint, then, almost surely, the residual formula is unsatisfiable and has no tree-like resolution 
proofs of less than exponential size. 

Proof: Due to limited space, we only give the proof for the case of the phase transition with 
respect to r in Model RD with i < a < 1. The other cases can be handled similarly. Assume 
that we have two assignments t\ and t2 and the similarity number [33] between t\ and t2 is S* (< 
ti , *2 >) = S. Let P be a random instance of Model RD. Based on the analysis in [33] (p. 96), the 
probability that both t\ and £2 satisfy P is 

(S\ / rnlnn 

\k) , n „\2 I 1 _ \kl 

"(2) 



Pr[ti and £2 satisfy P] = 



Now we suppose that to is a random forced satisfying assignment and t is an assignment with 
S^(< to,t >) = S. Let P sa t be a random forced satisfiable formula of Model RD with to as the 
forced satisfying assignment. Then the probability that t satisfies P sa t is 

Pr[to and t satisfy P] 



Pr[t satisfies P sa t] 



Pr[i satisfy P] 




rnlnn 



where g(s) = k( - k ~^ (g fc — s k ~ 1 ). Now, for the random forced satisfiable formula P sa t, we uniformly 
select without repetition -^n variables and then assign each of these variables a value from its 
domain at random. By the standard Chernoff bound, it is easy to show that the similarity num- 
ber between the forced satisfying assignment to and the random partial assignment to the 



12 



n 



variables is almost surely less than §n . For the residual formula, we have totally d n ~T2 n pos- 
sible assignments. Let t' be an assignment to the n — j^n variables of the residual formula with 
S-f(< to,f >) = S'. By assumption, the partial assignment to the -j^n variables does not violate 
any constraint. Thus, almost surely, the probability that t' satisfies the residual formula is at most 

rn In n 



6n a 



S'\ k 9 

- O(l)-- 
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6n° 



+ 
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O(l) 
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Let E[X] be the expected number of assignments satisfying the residual formula. Similar to the 
asymptotic analysis in [33] (p. 99), for r cr — e r < r < r cr , we have 



E[X}< J2 



S"=0 



( n « _ 1)«-T5«-S' 



(in 
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l-p 1- 







a + n , 




1 m\nn 
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tm(l-^) 



12 fc , j 

\ n rnlnn 



™- 12 n 



n 



S' 



rn In n 



n-S' 



for — < a < 1 
k 



Note that the forced satisfying assignment has no effect on the structure of constraint graphs. The 
rest of the proof is identical to that in Theorem 5 and so we are done. □ 
The above theorem, as far as we know, is the first complexity result of resolution-based al- 
gorithms on forced satisfiable instances, which further provides, from another aspect, a strong 
theoretical support for the method of generating hard satisfiable instances proposed in the last 
section. 



6. Conclusions 

In this paper, by encoding CSPs into CNF formulas, we proved exponential lower bounds for 
tree-like resolution proofs of two random CSP models with exact phase transitions, i.e. Model 
RB/RD. This result suggests that we not only introduce new families of CNF formulas hard for 
resolution, which is a central task of Proof-Complexity theory, but also propose models with both 
many hard instances and exact phase transitions. More interestingly, it is shown both theoretically 
and experimentally that an application of RB/RD might be in the generation of hard satisfiable 
instances, which is further supported by the exponential lower bounds established in Section 6. 

As mentioned before, there are some other NP-complete problems with proved exact phase 
transitions, e.g. Hamiltonian cycle problem and random 2+p-SAT (0 < p < 0.4). However, it has 
been shown either experimentally or theoretically that the instances produced by these problems 
are generally easy to solve. So one would naturally ask what the main difference between these 
"easy" NP-complete problems and RB/RD is. It seems that for these "easy" NP-complete problems 
with exact phase transitions, they usually have some kind of local property which can be used to 
design polynomial time algorithms working with high probability, and the exact phase transitions 
are, in fact, obtained by probabilistic analysis of such algorithms. So, it appears that if a problem 
has exact phase transitions obtained by algorithm analysis, then it also means that the problem is 
not hard to solve. For RB/RD, the situation is, however, completely different. More specifically, 
the exact phase transitions of RB/RD are obtained, not by analysis of algorithms, but by use of the 
first and the second moment methods which say nothing about the local property of the problem 
and are, therefore, unlikely to be useful for designing more efficient algorithms. Thus, it seems that 
RB/RD, unlike the "easy" NP-complete problems, can indeed provide a reliable source to generate 
random benchmark instances, as many and as hard as we need. 

Note that more recently, Frieze and Wormald [15] studied random /c-SAT for moderately growing 
k, i.e. k = k(n) satisfies k — log 2 n — > oo where n is the number of variables. For this model, they 
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established similarly, by use of the first and the second moment methods, that there exists a 
satisfiability threshold at which the number of clauses is m = 2 fe nln2. From Beame et al's earlier 
work on the complexity of unsatisfiability proofs for random fc-SAT formulas [6, 7], we know that 
the size of resolution refutations for this model is exponential with high probability. So, the variant 
of random fc-SAT studied by Frieze and Wormald is also a model with both proved phase transitions 
and many hard instances. 

To gain a better understanding of Model RB/RD, we now make a comparison of them with 
the well-studied random 3-SAT of similar proof complexity. First, we think that the exact phase 
transitions should be one advantage of RB/RD, which can help us to locate the hardest instances 
more precisely and conveniently when implementing large-scale computational experiments. As 
for the theoretical aspect, it seems that RB/RD, intrinsically, are much mathematically easier to 
analyze than random 3-SAT, such as in the derivation of thresholds. From a personal perspective, 
we think that such mathematical tractability should be another advantage of RB/RD, making it 
possible to obtain some interesting results which do not hold or can not be easily obtained for 
random 3-SAT, just as shown on forced satisfiable instances. 

In summary, the Hamiltonian cycle problem, random 3-SAT and Model RB/RD, respectively, 
exhibit three different kinds of phase transition behavior in NP-complete problems. Compared with 
the former two that have been extensively explored in the past decade, the third one (i.e. the phase 
transition behavior with both exact thresholds and many hard instances), due to various reasons, 
has not received much attention so far. From this point, the main contribution of this paper, we can 
say, is not in the mathematical techniques used, nor the concrete models studied (although such 
models are useful for CSP research in their own right), but pointing out an interesting behavior for 
study. Finally, we hope that more investigations, either experimental or theoretical, will be carried 
out on this behavior, and we also believe that such studies will lead to deep insights and new 
discoveries in this active area of research (i.e. on phase transitions and computational complexity). 
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Appendix 

Now we consider the proof of Lemma 2 for Model RB. Given a variable u an z-constraint 
assignment tuple T^ u . It is easy to see that the probability that u is flawed by T^ u increases with 
the number of constraints i. Thus we have 



Pr(T i:U is flawed)) 



i<3fc/3 In n 



< Pr(T i)U is flawed)) 



i=3kf3 Inn' 



For the variable u, there are d = n a values in its domain, denoted by v\,V2, ■ ■ ■ ,Vd- Let Pr(A,) 
denote the probability that Vj is not flawed by Tj jU . Thus the probability that at least one value is 
not flawed by Tj u , i.e. the probability that the variable u is not flawed by Tj u is 



Pr(Ai Ui 2 U"-U4 d ) 



Then 



Pr(Tj jn is flawed) 
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Recall that in Model RB, for each constraint, we uniformly select without repetition pd k incom- 
patible tuples of values and each constraint is generated independently. So we have 



Pr(AiA 2 ---A,- 



(d k -j\ 



(pd k ) 

(d k - pd k )(d k - pd k - 1) ■ • • (d k - P d k - j + 1) 
d k (d k -l)--- {d k - j + 1) 

Note that j < d = n a and k > 2. Now consider the case of i = 3fc/31nn, where (3 
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asymptotic analysis, we have 
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Let H(j) = [1 - JL. Ig^ + (_4)]3fc/31nn_ xhen wg ggt 

Pr(T iiU is flawed) | i=3 fc^ 1„ „ = 1 + 2J-1) J ( • ) Pr (^1^2 • • • ^j)|i=3fc/31nn 
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For < j < ns a , we can easily show that -ff(j) = 1 + o(l). Therefore, 
Pr(T iiU is flawed) |i =3fe/ 3 inn 

n / a\ n / a\ 
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It is easy to verify that 

(T) (n " f ^ ~ (~^ n ~ f )j = e j ~ jlnj+ f jlnn . 
Let -B(j) = j — jlnj + Inn. Differentiating f?(j) with respect to j, we obtain 

(X 4 

S (j) = — Inn — lnj < when j > n^ a . 

4 

So for nT> a < j < n a , we have 

( n °W f ) j < = (-J-)"*" = o(e- Ja ). 

V J / n 10 a 
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Note that H(j) = 0(n C2 ) for ns Q < j < n a , where C2 > is a constant. Hence, 

| £ {-lY^.)( n -SY(H(j) - 1)| < ^ Q J(n-f - 1| 

= 0{n a )0{n c *)o{e- n ^) = o(e" nf ). 

Thus we get 

a 

Pr(T ijU is flawed) | i< 3fe/ 3 in„ < Pr(T; iU is flawed) |i =3 fe/3inn ~ e - ™ 7 . 
The remaining part of the proof is identical to that of Lemma 2 for Model RD, and so we are done. 
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